Getting Started

  • 作业介绍
    • recitation-测试
    • homework-改进代码
  • 提前设置
    make <target> DEBUG=1
    

Recitation: Perf and Cachegrind

Perf

安装 Perf

出现了问题,已解决,但是没有记录

Perf 个人使用速查

perf record
用法:
$ perf record <program_name> <program_arguments>

running ./isort n k will sort an array of n elements k times.

$ sudo perf record ./isort 10000 10
Sorting 10000 values...
Done!
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.035 MB perf.data (887 samples) ]
perf report
用法:直接perf report
$ sudo perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 887  of event 'cpu-clock:pppH'
# Event count (approx.): 221750000
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  .................
#
    99.44%  isort    isort              [.] isort
     0.23%  isort    libc-2.27.so       [.] rand_r
     0.11%  isort    [kernel.kallsyms]  [k] queue_work_on
     0.11%  isort    [kernel.kallsyms]  [k] release_pages
     0.11%  isort    isort              [.] main

这里就可以看到 isort 占据了很大一部分时间

Cachegrind

Cachegrind (a Valgrind tool) is a cache and branch-prediction profiler On virtual environments like those on AWS, hardware events providing information about branches and cache misses are often unavailable, so perf may not be helpful.

valgrind
使用:
$ valgrind --tool=cachegrind --branch-sim=yes <program_name> <program_arguments>
输出解释:
D1 represents the lowest-level cache (L1)
LL represents the last (highest) level data cache (on most machines, L3)

例子代码

// Copyright (c) 2012 MIT License by 6.172 Staff

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint32_t data_t;
const int U = 10000000;   // size of the array. 10 million vals ~= 40MB
const int N = 100000000;  // number of searches to perform

int main() {
  data_t* data = (data_t*) malloc(U * sizeof(data_t));
  if (data == NULL) {
    free(data);
    printf("Error: not enough memory\n");
    exit(-1);
  }

  // fill up the array with sequential (sorted) values.
  int i;
  for (i = 0; i < U; i++) {
    data[i] = i;
  }

  printf("Allocated array of size %d\n", U);
  printf("Summing %d random values...\n", N);

  data_t val = 0;
  data_t seed = 42;
  for (i = 0; i < N; i++) {
    int l = rand_r(&seed) % U;
    val = (val + data[l]);
  }

  free(data);
  printf("Done. Value = %d\n", val);
  return 0;
}

编译

$ make sum

剖析

$ valgrind --tool=cachegrind --branch-sim=yes ./sum

输出

==125== Cachegrind, a cache and branch-prediction profiler
==125== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==125== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==125== Command: ./sum
==125==
--125-- warning: L3 cache found, using its data for the LL simulation.
Allocated array of size 10000000
Summing 100000000 random values...
Done. Value = 938895920
==125==
==125== I   refs:      3,440,227,630
==125== I1  misses:            1,195
==125== LLi misses:            1,180
==125== I1  miss rate:          0.00%
==125== LLi miss rate:          0.00%
==125==
==125== D   refs:        610,074,405  (400,058,343 rd   + 210,016,062 wr)
==125== D1  misses:      100,507,416  ( 99,881,550 rd   +     625,866 wr)
==125== LLd misses:       37,859,997  ( 37,234,195 rd   +     625,802 wr)
==125== D1  miss rate:          16.5% (       25.0%     +         0.3%  )
==125== LLd miss rate:           6.2% (        9.3%     +         0.3%  )
==125==
==125== LL refs:         100,508,611  ( 99,882,745 rd   +     625,866 wr)
==125== LL misses:        37,861,177  ( 37,235,375 rd   +     625,802 wr)
==125== LL miss rate:            0.9% (        1.0%     +         0.3%  )
==125==
==125== Branches:        210,043,840  (110,043,336 cond + 100,000,504 ind)
==125== Mispredicts:           5,456  (      5,248 cond +         208 ind)
==125== Mispred rate:            0.0% (        0.0%     +         0.0%   )

查找 Cache 信息

lscpu
用法:
$ lscpu
作用:
find information about your CPU and its caches
$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               154
Model name:          12th Gen Intel(R) Core(TM) i7-12700H
Stepping:            3
CPU MHz:             2687.998
BogoMIPS:            5375.99
Virtualization:      VT-x
Hypervisor vendor:   Microsoft
Virtualization type: full
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            24576K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabilities

在我们这里就可以看到 L1、L2 和 L3 的 Cache 信息了

再观察输出结果,因为我们写 Cache miss 少(代码顺序写)和读 Cache miss 多(代码随机读) 然后我们通过改变 U 和 N,控制 U 小于 L1、L2、L3 Cache 等情况,就能看出 N、U 大小和 D、L、LLD Cache miss 的关系

Homework: Sorting

Write-up 1:

Compare the Cachegrind output on the DEBUG=1 code versus DEBUG=0 compiler optimized code. Explain the advantages and disadvantages of using instruction count as a substitute for time when you compare the performance of different versions of this program

$ make
clang main.c tests.c util.c isort.c sort_a.c sort_c.c sort_i.c sort_p.c sort_m.c sort_f.c -O3 -DNDEBUG -g -Wall -std=gnu99 -gdwarf-3 -always-inline -lrt -lm  -o sort
clang: warning: argument unused during compilation: '-always-inline' [-Wunused-command-line-argument]

$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==184== Cachegrind, a cache and branch-prediction profiler
==184== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==184== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==184== Command: ./sort 10000 10
==184==
--184-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.014364 sec
sort_a repeated : Elapsed execution time: 0.013956 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.027513 sec
sort_a repeated : Elapsed execution time: 0.027134 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==184==
==184== I   refs:      235,925,874
==184== I1  misses:          1,586
==184== LLi misses:          1,467
==184== I1  miss rate:        0.00%
==184== LLi miss rate:        0.00%
==184==
==184== D   refs:       87,546,459  (52,667,725 rd   + 34,878,734 wr)
==184== D1  misses:        228,442  (   127,084 rd   +    101,358 wr)
==184== LLd misses:          5,123  (     2,420 rd   +      2,703 wr)
==184== D1  miss rate:         0.3% (       0.2%     +        0.3%  )
==184== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==184==
==184== LL refs:           230,028  (   128,670 rd   +    101,358 wr)
==184== LL misses:           6,590  (     3,887 rd   +      2,703 wr)
==184== LL miss rate:          0.0% (       0.0%     +        0.0%  )
==184==
==184== Branches:       40,474,926  (38,773,975 cond +  1,700,951 ind)
==184== Mispredicts:     2,484,878  ( 2,484,522 cond +        356 ind)
==184== Mispred rate:          6.1% (       6.4%     +        0.0%   )

$ make clean
rm -f ./sort *.std* *.gcov *.gcda *.gcno default.profraw

$ make DEBUG=1
clang main.c tests.c util.c isort.c sort_a.c sort_c.c sort_i.c sort_p.c sort_m.c sort_f.c -DDEBUG -O0 -g -Wall -std=gnu99 -gdwarf-3 -always-inline -lrt -lm  -o sort
clang: warning: argument unused during compilation: '-always-inline' [-Wunused-command-line-argument]

$  valgrind --tool=cachegrind --
ranch-sim=yes ./sort 10000 10valgrind: no program specified
valgrind: Use --help for more information.
cjl@ChenJulian:~/solution/mit6.172/Homework/HW2/homework$  valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==206== Cachegrind, a cache and branch-prediction profiler
==206== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==206== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==206== Command: ./sort 10000 10
==206==
--206-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.025243 sec
sort_a repeated : Elapsed execution time: 0.025381 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.049681 sec
sort_a repeated : Elapsed execution time: 0.049848 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==206==
==206== I   refs:      408,412,706
==206== I1  misses:          1,553
==206== LLi misses:          1,457
==206== I1  miss rate:        0.00%
==206== LLi miss rate:        0.00%
==206==
==206== D   refs:      260,697,695  (194,384,270 rd   + 66,313,425 wr)
==206== D1  misses:        228,025  (    126,767 rd   +    101,258 wr)
==206== LLd misses:          5,115  (      2,411 rd   +      2,704 wr)
==206== D1  miss rate:         0.1% (        0.1%     +        0.2%  )
==206== LLd miss rate:         0.0% (        0.0%     +        0.0%  )
==206==
==206== LL refs:           229,578  (    128,320 rd   +    101,258 wr)
==206== LL misses:           6,572  (      3,868 rd   +      2,704 wr)
==206== LL miss rate:          0.0% (        0.0%     +        0.0%  )
==206==
==206== Branches:       45,174,708  ( 43,473,813 cond +  1,700,895 ind)
==206== Mispredicts:     3,109,490  (  3,109,160 cond +        330 ind)
==206== Mispred rate:          6.9% (        7.2%     +        0.0%   )

得出结果

DEBUG 0 1
时间
指令数
Cache Misses 率
MISpredicts 率

O0 牺牲了 cache 命中率,但是指令数少,时间快

inlining

You would like to see how much inline functions can help. Copy over the code from sort_a.c into sort_i.c, and change all the routine names from _a to _i. Using the inline keyword, inline one or more of the functions in sort_i.c and util.c. To add the sort_i routine to the testing suite, uncomment the line in main.c, under testFunc, that specifies sort_i. Profile and annotate the inlined program.

Write-up 2:

Explain which functions you chose to inline and report the performance differences you observed between the inlined and uninlined sorting routines.

如题,将 sort_i 中的 merge_i 和 copy_i 设置为 inline。并进行测试

编译和 perf record 的过程不再给出

  • no use inlining & DEBUG=1
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf record ./sort 10000 10

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.000694 sec
sort_a repeated : Elapsed execution time: 0.000679 sec
sort_i          : Elapsed execution time: 0.000679 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.000950 sec
sort_a repeated : Elapsed execution time: 0.000989 sec
sort_i          : Elapsed execution time: 0.000942 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.006 MB perf.data (116 samples) ]
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 116  of event 'cpu-clock:uhpppH'
# Event count (approx.): 29000000
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ...........................
#
    43.10%  sort     sort           [.] sort_a
    17.24%  sort     libc-2.27.so   [.] cfree@GLIBC_2.2.5
    16.38%  sort     libc-2.27.so   [.] malloc
    15.52%  sort     sort           [.] sort_i
     2.59%  sort     sort           [.] mem_alloc
     1.72%  sort     sort           [.] mem_free
     0.86%  sort     libc-2.27.so   [.] intel_check_word.isra.0
     0.86%  sort     libc-2.27.so   [.] rand_r
     0.86%  sort     sort           [.] free@plt
     0.86%  sort     sort           [.] malloc@plt


#
# (Tip: Show current config key-value pairs: perf config --list)
#
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==698== Cachegrind, a cache and branch-prediction profiler
==698== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==698== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==698== Command: ./sort 10000 10
==698==
--698-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.014621 sec
sort_a repeated : Elapsed execution time: 0.014605 sec
sort_i          : Elapsed execution time: 0.014851 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.028227 sec
sort_a repeated : Elapsed execution time: 0.028304 sec
sort_i          : Elapsed execution time: 0.028753 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==698==
==698== I   refs:      351,917,014
==698== I1  misses:          1,624
==698== LLi misses:          1,494
==698== I1  miss rate:        0.00%
==698== LLi miss rate:        0.00%
==698==
==698== D   refs:      130,160,086  (78,415,144 rd   + 51,744,942 wr)
==698== D1  misses:        324,704  (   184,171 rd   +    140,533 wr)
==698== LLd misses:          5,123  (     2,421 rd   +      2,702 wr)
==698== D1  miss rate:         0.2% (       0.2%     +        0.3%  )
==698== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==698==
==698== LL refs:           326,328  (   185,795 rd   +    140,533 wr)
==698== LL misses:           6,617  (     3,915 rd   +      2,702 wr)
==698== LL miss rate:          0.0% (       0.0%     +        0.0%  )
==698==
==698== Branches:       60,182,795  (57,681,766 cond +  2,501,029 ind)
==698== Mispredicts:     3,703,747  ( 3,703,346 cond +        401 ind)
==698== Mispred rate:          6.2% (       6.4%     +        0.0%   )
  • use inlining & DEBUG=1
$ cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf record ./sort 10000 10

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.001043 sec
sort_a repeated : Elapsed execution time: 0.001049 sec
sort_i          : Elapsed execution time: 0.001048 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.001652 sec
sort_a repeated : Elapsed execution time: 0.001609 sec
sort_i          : Elapsed execution time: 0.001595 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.009 MB perf.data (203 samples) ]
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 203  of event 'cpu-clock:uhpppH'
# Event count (approx.): 50750000
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ................................
#
    34.98%  sort     sort           [.] merge_a
    15.27%  sort     sort           [.] merge_i
    12.32%  sort     sort           [.] copy_a
     9.85%  sort     libc-2.27.so   [.] cfree@GLIBC_2.2.5
     8.87%  sort     sort           [.] copy_i
     5.42%  sort     sort           [.] sort_a
     3.45%  sort     libc-2.27.so   [.] malloc
     1.97%  sort     sort           [.] copy_data
     1.97%  sort     sort           [.] mem_alloc
     1.97%  sort     sort           [.] sort_i
     0.99%  sort     sort           [.] init_data
     0.99%  sort     sort           [.] mem_free
     0.49%  sort     ld-2.27.so     [.] strcmp
     0.49%  sort     libc-2.27.so   [.] __fxstat64
     0.49%  sort     libc-2.27.so   [.] __memmove_avx_unaligned_erms
     0.49%  sort     sort           [.] malloc@plt


#
# (Tip: Customize output of perf script with: perf script -F event,ip,sym)
#
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==758== Cachegrind, a cache and branch-prediction profiler
==758== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==758== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==758== Command: ./sort 10000 10
==758==
--758-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.025676 sec
sort_a repeated : Elapsed execution time: 0.025433 sec
sort_i          : Elapsed execution time: 0.025869 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.050344 sec
sort_a repeated : Elapsed execution time: 0.050162 sec
sort_i          : Elapsed execution time: 0.050949 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==758==
==758== I   refs:      608,332,662
==758== I1  misses:          1,574
==758== LLi misses:          1,481
==758== I1  miss rate:        0.00%
==758== LLi miss rate:        0.00%
==758==
==758== D   refs:      388,800,501  (289,890,848 rd   + 98,909,653 wr)
==758== D1  misses:        323,971  (    183,785 rd   +    140,186 wr)
==758== LLd misses:          5,122  (      2,419 rd   +      2,703 wr)
==758== D1  miss rate:         0.1% (        0.1%     +        0.1%  )
==758== LLd miss rate:         0.0% (        0.0%     +        0.0%  )
==758==
==758== LL refs:           325,545  (    185,359 rd   +    140,186 wr)
==758== LL misses:           6,603  (      3,900 rd   +      2,703 wr)
==758== LL miss rate:          0.0% (        0.0%     +        0.0%  )
==758==
==758== Branches:       67,436,323  ( 64,935,328 cond +  2,500,995 ind)
==758== Mispredicts:     4,682,231  (  4,681,844 cond +        387 ind)
==758== Mispred rate:          6.9% (        7.2%     +        0.0%   )

我们可以看到内联函数会导致时间花费更长。

在这个写作任务中,你需要解释递归函数内联化可能导致的性能下降,并说明使用 Cachegrind 收集的分析数据如何帮助你测量这些负面性能影响。

递归函数内联化的潜在性能问题包括:

  1. 代码膨胀:内联递归函数会导致代码膨胀,因为每次递归调用都会展开为相应的代码,这可能会导致生成更多的指令。代码膨胀可能会增加指令缓存的压力,降低缓存命中率。
  2. 栈消耗:递归函数通常使用函数调用栈来保存每个递归调用的状态。内联递归函数可能会导致栈消耗过大,尤其在递归深度较大时。较大的栈消耗可能会导致栈溢出或者减慢程序的执行速度。
  3. 冗余计算:内联展开递归函数的过程中,可能会进行一些冗余计算,因为相同的计算可能在不同的展开代码中多次出现。这会增加指令执行的开销,降低程序的效率。

Pointers vs Arrays

Write-up 4

Give a reason why using pointers may improve performance. Report on any performance differences you observed in your implementation.

指针

Coarsening

Write-up 5

Explain what sorting algorithm you used and how you chose the number of elements to be sorted in the base case. Report on the performance differences you observed.

  • 未优化

代码

static inline void merge_m(data_t *A, int p, int q, int r)
{
  assert(A);
  assert(p <= q);
  assert((q + 1) <= r);
  int n1 = q - p + 1;
  int n2 = r - q;

  data_t *left = 0, *right = 0;
  mem_alloc(&left, n1 + 1);
  mem_alloc(&right, n2 + 1);
  if (left == NULL || right == NULL)
  {
    mem_free(&left);
    mem_free(&right);
    return;
  }

  copy_m(&(A[p]), left, n1);
  copy_m(&(A[q + 1]), right, n2);
  left[n1] = UINT_MAX;
  right[n2] = UINT_MAX;

  int i = 0;
  int j = 0;

  for (int k = p; k <= r; k++)
  {
    if (left[i] <= right[j])
    {
      A[k] = left[i];
      i++;
    }
    else
    {
      A[k] = right[j];
      j++;
    }
  }
  mem_free(&left);
  mem_free(&right);
}
cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==41== Cachegrind, a cache and branch-prediction profiler
==41== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==41== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==41== Command: ./sort 10000 10
==41==
--41-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_m          : Elapsed execution time: 0.031639 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_m          : Elapsed execution time: 0.062058 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==41==
==41== I   refs:      208,493,318
==41== I1  misses:          1,554
==41== LLi misses:          1,458
==41== I1  miss rate:        0.00%
==41== LLi miss rate:        0.00%
==41==
==41== D   refs:      132,595,246  (98,877,918 rd   + 33,717,328 wr)
==41== D1  misses:        131,781  (    69,494 rd   +     62,287 wr)
==41== LLd misses:          5,116  (     2,413 rd   +      2,703 wr)
==41== D1  miss rate:         0.1% (       0.1%     +        0.2%  )
==41== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==41==
==41== LL refs:           133,335  (    71,048 rd   +     62,287 wr)
==41== LL misses:           6,574  (     3,871 rd   +      2,703 wr)
==41== LL miss rate:          0.0% (       0.0%     +        0.0%  )
==41==
==41== Branches:       22,912,991  (22,012,175 cond +    900,816 ind)
==41== Mispredicts:     1,558,061  ( 1,557,736 cond +        325 ind)
==41== Mispred rate:          6.8% (       7.1%     +        0.0%   )
  • 优化后:

报错,感觉有点花费时间。先不细究了。

results matching ""

    No results matching ""